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derived ii more sensitive in d *^ n9 °' „ The method 
^bed in the >>>/«. volume 3* and Volume 5. 



Th. matrix of .eoepted point muoticn. 

„,! ihawn in figure 79. We H> — ™* *«*• 

tfthi. wumption, no ch.n S . in — — *•»-«"• 

ever evolutionary oltw.ee will t» detested. 

By comparing observed sequence, w,* interred 
J I sequences. r,*er then with euh other. • 



Accepted Point Mutations 



An accepted point nation in a 

Lond is - »" pana ^Vor,oc,^ tnHew 

n . -u* n8 w predominanT form. To tw acwpw» 

old on.: chemical and phy»=al splint* ™V 
w „ n the .mine -tchb Mt ara otaerved to mtercnange 

"TnT^pt- dl.eusslon a* the-.te.rvod behavior of 
' Jno addsln the evolutionary process must confer the 
rZerTo change of each amino acid to each o*.r one 

OT 20 X 20 • 400 possible companions. To collect £ use- 

prcSn. eppeBring in the MM volume, through Supple 
"-.ut^d^e— 

^rltlneiv 9 ene raM d C on>i de, «or «am £ £ 
much simplified .rrifid.1 phylosemc tree of F.gure 
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tree of Figure 7(L 
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i * e acceptable point mutations is obtained. In 
picture of *e KapoHt j ^ ^ ^ 

tfe first ammo rod ™ ™ ra r but C and D do 

have Treated the chanflM snmsawry. 

The total numbers of acceptea P a,nv 
served between closely related sequent from 3Awp« 

ttrveg , . _ 71 dufliirrJorury trees, are shown 

families, ponced into 71 fiV «!™™ ry ^rrencB of 

• •=■-.. ™ an in order to minimize m« 

,n F.sun: 80. In or ™ ^p^d mutations at one 

S |„, the sequences tfrthrn » ™* were 

even closer. Of the is* P°« _ USUB n v involved 

the ammo acids Hut "™ more xhBn one 

highly mutable and exchanges where mar* 



nucleotide of the codan must ehana.. Of 
SU. the larc*« number. 83. vm ob^ed between 
Aso wd Clu. wo cnOTiollY very ""mo acids 

^on, differ by on. nudeotid* About 20* of 
* hOTta** far more than one would expect for** 
*„X sequences. ^ amino , rift w^e 
differed by more than one nucleotide. Presumably, .n 
JX> vL. Chans- at eons, of the amino «id posroons 

footed by selection and mult^ 
HMtW. she* are favor*. Meny of *. 

mutton, of on. nucleotide In , codon are 
«L, observed. Presumably *«. mutabons have 

tut have b-n reiected 
Sn« on the proteins. <=or example. *ere were no ex- 
changes between Gly and Tip. 

Mutability of Amino 

A complete picture of the mutational process must 
J£ a consideration of the amino adds that d,d not 




: : : : : « *• - - - - - - - : : < , _ 

h qm arc show*, fractional ..eh*,** «u.t 

rs=-^==5==»== =r- — — 
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te( o, time. that It h. 

thu. h* **n nmtat.on.Th . *W 

* - « W 'l^,^ *f^ 0 n for . 

to occurrences. Figure 81 lmiauow* r . 

ZSTc in which B changes r***f —W 

often, and D never. 



Table 21 

Relative Mutabilities ot^smno^^ 



Asn 

Ser 

Asp 

Glu 

Ala 

Thr 

Ha 

Met 

Gin 

Val 



134 
120 
106 
102 
100 
97 
96 
94 
93 
74 



His 


66 


Arg 


65 


Lys 


56 


Pro 


56 


GW 


49 


Tyr 


41 


Phe 


41 


Uu 


40 


Cys 


20 


Trp 


18' 



»TT* wlut for Ata tua been wtttrahW «t et 100. 



Aligned 

sequences 
Amino acids 
Changes 

■f' \ Frequency of occurrence 

(total composition) 

Relative mutability 

8 i. 

ai.gnad SBQUtncas m«y h- wo "f*^^^ 
or an ob^rvd noutnce and It. Irrfem* anchor. 
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B 
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1 




1 


0 


3 




1 


2 


.33 




1 


0 



,„ -"KIEL'S*!! 

trees comidar*. The denominator is *' 

by the total number of mutations per 

inTabl. 21. On *. ivtw. *»• *»■ &U 

The immutabilrty of cywine is Mn( : h i e 
. C¥! *„e i, » H„e »«ral — ^JE? £ 

tu^ons. It i. th. attachment * » "^"T 
eytoehrom. ni of FoS clours ,n 
SU-link. in ottr pr»«»> «* « *^ p T mpor - 
rlbbnuel««. It seldom occur. wWiout having an 

on of on. of * J— ■£ — ; 



properties. On the wing* « H highly mutable. 
Amino Acid Frequencies in *a Mutation Daa 

amino Kid. t» shown i"™',^!^™, 

. f„ *. tree. The sum of* fr.qu.naa » 1. 

Mutation Probability ** 
Evolutionary Distance of One PAM 

w . ^ combine Intorm^on 'bout *. 

JL of mutrfon, and -^JSlT^ 
th. amino acids into on. d,«^c6^P^°« 

::: x. r r^r^ 1 

a given evolutionary interval. In th.s case 1 PAM. 



Table 22 

Normalized Frequencies of tl^ino^i^ 
t^p Accepted Point Mutation Data 



Gly 

Ala 

Uu 

Lys 

Ser 

Val 

Thr 

Pro 

Glu 

Asp 



0.069 

0X197 

0.035 

0.081 

0.070 

0.065 

0.058 

Q.0S1 

0.060 

0.047 



Arg 


0.041 


Asn 


0.040 


Pha 


0.040 


Gin 


0.038 


lit 


0.037 


His 


0.034 


Cyt 


0.033 


Tyr 


0.030 


Met 


0.015 


Trp 


0.010 
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where 



lEIC ( 

^ * an element of the accepted point mutation mflnx 
of Figure 80. 

X is a proportionality constant, and 

m, b *. mutability of the !* amino «U, Table 21. 

The diagonal elemenn have the values: 
Wljj a 1 -Xmj 

Conner . tff** ».umn. .harfor MM. Th« «- 
probability. *. — «f '» *• "• m " m - mUB * 1 ' ^ 



probability 0< obwvW • *«» in • """''T* 
alanme im« bi „ w af .lanins. Ths sama pro- 

pr0 p 0 ™««l » £ ~* ,o r ... eofcrn*. The in- 

^choseVso that «b chan* h 1 mum>or,. an« *«r. 
£VS£T£ suparlmposad char,**. «h •>» "P""™ 
%moi ctant*. H X M b*« four tbnes « ft*. *a 
*T l«rfx would have r«pn*™d 4 PAM S; *e 4w» 
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ability the «n'no «*d 1" wl«mo I will ^ J^ K ^ ° V 
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Simulation of the Mutational Process 



For evaluating statistical methods of detecting relation- 
^ : methods of muring evolutionary 

Sees between proteins, and for «m. * 
accuracy of programs to construct evolutmnary 
nted to have examples of proteins at known ^o — 
distances. The mutation probability matnx provide, the 
information with which to simulate any 
tionary change in an unlimited number 
Further, we can start with one protem and 
separate evolution in duplicated genes or .n d.vergent 
or^ms. By considering many groups of ^uences 
retard by the same evolutionary h.story. » "™^™ " 
readily obtained of the expected deviations due to ran- 
dom fluctuations in the evolutionary process. 

If we only require that, on the averts, one mutatmn 
takes place in the evolutionary interval of.1 ^e can 
use a simulation requiring one random number for each 
amino acid in the sequence, as follows: To delete 
fate of the first amino acid, say Ala. a umf ormly deb- 
uted random number between 0 and 1 is qbtemed. The 
first column of the mutation probability matrix (F,gure 
82) gives the relative probability of each possible event 
that may befall Ala (neglecting deletion for «^«^ 
lf the random number fails between 0 and .9B6T Ala is 
left unchanged. If the number is between .9*7 and 
.9368, it is replaced with Arg. if it is between S86B and 
.9872, it is replaced with Asp, and so forth. Snarly, a 
random number is produced for each ammo add in the 
Tquence. and action is taken as dictated by the corn, 
sponding column of the matrix. The result .s a s.mulated 
mutant sequence. Any number of these can be je crated; 
their average distance from the original is 1 PAM although 
some may have no mutations and some may have two or 
more. The effects, on the sequence of a longer period of 
evolution may be simulated by successive appl.cations ot 
the matrix to the sequence resulting from the last appllca- 

t<0 For simulations in which a predetermined number of 
changes a« required, a two-rap process Involving two 
random numbers for each mutation can be S«rt,n fl 
with a given sequence, the first ammo acid that wjH 
mutate it selected: the probability ™* J™, M ' 1 * 
selected is proportional to its mutability (T able 21) Then 
the amino acid that replaces it- is chosen. The probab.l.ty 
for each replacement is proportional to the elements in 
th. appropriate column of Figure 82. Starting whh ihe 
resultant sequence, a second mutation can be amulawi, 
and so on. until a predetermined number of changes have 
been made. In this process, superimposed and bade muta- 
tions may occur. 



The 1 PAM matrix can be multiplied by itself N times 
to yield a matrix that predicts the amino add replace- 
ment, to be found after N PAMs of evolutionary change 
^Tsequence of average composition. On the *«race. 
*e results of the simulations above match the pred.ctions 
of the corresponding' matrices. 



[Mutation Probability Matrices for 
Other Distances 

The mutation probability matrix M , , corresponding to 
1 PAN! has a number of interesting properties (see Figure 
821 . If, in a simulation, it is applied to a protem with the 
average amino Kid composition given in Table 22, on die 
averagt the composition of the resulting mu^ed proteins 
will be unchanged. Repeated applications of the matrix 
Z proteins of any other composition will gwe mutants 
that change toward average composition; any such meix.x 
has implicit in it some particular asymptotic composition. 

There is a different mutation probability matrix for 
each evolutionary interval. These can be *^^^*J 
one for ! PAM by matrix multiplication. ^™ 
matrix Is multiplied by itself an infinite number of tlm«. 
each column of the resulting matrix approaches the 
asymptotic amino acid composition: 



f A f A f A i A 
*R *R f H f R 

t 

*N f N 



At a great distance, there is very litde relationship Inter- 
mation^eft in the matrix. For example, at a dm* of 
2 034 PAMs all of the matrix values are within 5* or 
tneir limiting values except for the Trp-Tn? aiemer* 
which is 7S% higher than the limit, and the Cyi-Cys ele- 
ment, which is 11% higher. 

The matrix for 0 PAM. Is simply a win d.egonel. no 
amino acid would have changed: 



M 0 ' 



1 0 0 

0 1 o 
0 0 1 



The mutation probability matrix for 2S0 PAMs is 
iXre 83. At «. evolutionary = «£ 
ona amino acid in five remains unchanged. However, the 
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a mino acids vary ^atly in their mutability; 55% of *• 
tryptophans, 52% of the cysteine, and 27% of the glycines 
would still be unchanged, but only 6% of the highly muta- 
ble asparagines would remain. Several other ?mmo acids, 
particularly alanine, ispartic acid, glutamic acid, glycine, 
lysine and serine are more likely to occur in place of an 
original asparagine Than asparagine taarf at this evolu- 
tionary distance! This is understandable from the data 
giving the preferred mutations and tha relative mutabili- 
ties. Asparagine is highly mutable, therefore it changes to 
other amino acids. These are less mutable and may not 
change again. This effect is much more conspicuous in 
the case of methionine. Surprisingly, a methionine ongi- 
rally present would have changed to leucine in 20% of 
the cases, but would remain methionine in only 6%. Over 
one-third of the mutations in methionine, are specifically 
to leucine (Figure BO). Leucine is less than one-hatf as 
mutable as methionine (Table 21). 



From the series of distance-dependent mutation prob- 
ability matrices, we can compute detailed answers to the 
question "How does the evolutionary process affect the 
similarity of related protein sequences?" 

Estimation of Evolutionary Distance 

There is a different mutation probability matrix for 
each evolutionary interval measured in PAMs. For each 
such matrix, we can calculate the percentage of amino 
acids that will be observed to change on the average in 
the interval by the formula: 

100(1 - ZfiMjj) 
! 

Table 23 shows the correspondence between the observed 
percent difference between two sequences and tha evolu- 
tionary distance in PAMs. We use this scale to estimate 
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Figure 83- Mutation probability matrix for the evolutionary dfe- 
tance of 250 PAMs. To rfmplity the oppearonce. the elements 
ire shown multiplied by 100. In comparing two jequirwi of 
averse amino add rreo^cnev « evolutionary disiance. there 
it • 1M probability that a position conttininfl Al» In the first 



nquenc. will contain Ala in (he «co«t There I, a 3% chance 
That It will con rain Arg. and so forth. Tha rtfatiomn.n of two se- 
quences at a distance of 250 PAMs can be demonstrated by & 
listicsl method*. 
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Table 23 

Correspondence between Observed Differences 
and Ab Evolutionary Distance 



Observed 


Evolutionary 


Pwcent 


Distance 


Difference . 


in PAMs 


1 


1 


5 


5 


10 


n 


15 


17 


20 


23 


25 


3Q 




38 


35 


47 


40 


56 


45 


67 


SO 


B0 


55 


94 


GO 


112 


65 


133 


70 


159 


75 


195 


80 


243 


85 


328 



evolutionary distances from marie of percent dlfferenc. 
££L s^encea. The- estimated dhunco TT^ 
ln the computations of evolutionary trees in thrs book. 

difference, predicted for a given »M ««« 
by up to 23S from those that we raported **»■»»"«; 
A mora complete scale is given In Tabl. 38 of the Appen- 
dix. 

Relatedness Odds Matrix 
The elem.no. M„. of the mutation probability matrix 
for each distance give the probability IW am.no acid I i 
changa to i in . related nuance ir .that Jnojrval The 
normaliad frequency f, gl«s *. probability that i w.ll 
occur in the second sequence by chance. 

The terms of the relatedness odd. matrix are then. 

M„ 

R " = T 



The odd. matrix is symmetrical. Each arm give. ***** 
ability of replacement per occurrence of i P*r occurrence 
of j. 



Amino ado pairs .» score, above 1 l«pwa etth 
other mora often a. alternatives in related sequent than 
°Tldom .sequences of *. «*. ^™ 
*ose with scores below 1 replace each o*ar taion* 

The information in *. 2S0-PAM h * 
rJran very useful in detecting distant relat.onsh.ps be- 
^een Zenas. When one protein i, compared wi* 

^ nntwTn bv oosrtion. one should multiply the 

P^eln. However, it Is mora convenient to add * 
*£m, of *. matrix element*. The loo of the 2SO-PAM 
odds matrix "a shown in Figure 84. 

The Chemical Meaning of Amino 
Acid Mutations 

Patterns hav. bean visible in th. ^ 
tlon. since the beginning, of prota* 
uoleucine-valine and serine-threonine were frequently 

Llllty had something to do with their chemical similan 
£Tta the large amount of information that now ««t». 
rmore^taiM correlation, are visible, and many more 
functional inferences can be made. 

in the log odd. matrix of Figure 8*. the order of the 
Jno acid? ha, been reload to 
groups of chemically similar amino 
rep,™ on. another: th. hydrophobic group: *« ™ e . 
£oup: *. baalc group,*, acid. J*"* 

and the other hyd^philie 
overlap: *. basic and acid, ^^f^T^Z 
apiece on. another to Kane extant, and P*"^»"™ 
mwrchanges whh th. hydrophobic group more often then 
cMnt fetation would predict. A— « 
££sed pSly b* natural salmon and only second- 
Xby th. constrain* of the genetic code: thay reflect 
* "similarity of th. Won. o» th. amino acid resrfu* 
In their waak interactions Wr* one *M -«* 
dimanaunal conformation of P^ n * *™ 

bonds, and hydrogen bond*. 

Computing Relationships between Sequences 
Wa ^ log odds matrix as scoring matrices 

Jver, d^**-* be— ^ « ~ 

in« matrie*. based uMmt-lY on accepted point mum 
tiers discriminate dgnificant relanonsh.ps from 
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Figure &», Leg odds matrix tor 250 PAWU. eterroro are «hown 
multiplied by 10. Th. neutral scort is zero. A «o* of-^ m«m 
that dta pair would b- axp-ctad to occur only " ™* 

wantlv in ralatad sequences » random dianca would predict, and 

random coincidences better than simpler scoring ^sterns. 
Mere counts of identities and matrices baaed only an the 
changes predicted by the genetic code are not sufficiently 
complex. It is obvious that there, is a good deal of infor- 
mation in the derailed nature of both the nonidentrties and 
the identities. Certain combinations of different ammo 
acids are positive evidence of relatedrwss, and others are 
contraindications. The log odds matrix for 250 PAMs, 
which wa have found to be a very effective scoring matrix 
for detecting distant relationships, is compared with other 
matrices in chapter 23. 



a saw- of +2 mama that oia pair would be expected to oceur 1S 
time, m fraqu-ntiy. Th. ordar of the amino adds has bean ar- 
ranged to lilunrate trm partem* In <he mu-nrdon dm. 
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23 Matrices for Detecting 
Distant Relationships 

H.M. Schwartz and M.O. Dayhoff 



When two proteins descended from a ™ n J^ 
,„ have accumulated 4 larg. numbar of ammo aad a* 
^ns in their mum., it b dlf1te.lt to sjrt** 
... S ir common origin. By a careful ..lection of *f J**; 
t> it is Possibi. to detect very distant ^ tio ~«X 
" p.risons that rely on statistical evaluation, o f 

ity between the sequence The» method, all dependon 
*, comparison of an amino acid in one sequence vvtth, 
corroding amino *id in another sequence. Each 

accumulated over a string =f residue, to ^ 
this chapter we evaiuat. and compare «varal .coring «, 
The Amplest system assign, a value of +1 to riantl 
al ra.Wv.ea and 0 to nonldentjcal one.; the .coring memx 
^pending to this sy«em i. died the unitary met™ 
(UM). A slightly more complicated scori ng system reflect- 
na L minimum number of base ch.nga. nK,u «d « 
e,ter the eodon for on. amino acid to «il for 
assigns 3 for amino acid identities. 2 for •*» 
w hos. codons differ by a singl. base I tar ammo 
„nos, codons differ by wo ba.es. and 0 
whose codons differ in all three bw;« *»" 
.. *. aen.tie code matrix (GC Ml . In .1971. . i m.trU M 
tN"= on alternative mine acids IAAAM) at each postaon m 
alignments of group, of related sequences was darned by 
McLachlan.^ >n 1987. we derived • J^ny. matr£ 
called the mutation data matrix [MDM„>, rom 421 
acopted point mutations observed in closely related 
fences Mailable then.' In 1989. we r*«u« dated the 
mutation data matrix (MOM^ on the ban .of 814 
accepted point muutlon. * These mrtr.ee. were denved 
in eaentlally the same manner described in chapter 22. 

In our experience, scoring systems represent the 
avarage way in which amino rid. change ^ "J* 
tion have proved men ..tUfiKtory for d '™ 
relationship, between protein sequeneu . Or i th e » i.of 
data available through Supplement 2. including 1.5" 



mutation., we have again recalculated *. ™uOT°n 
Tbimy matrices, th. odd. matrices, and th. log odds 
metric* for various evolutionary distance. <** ^ 
Presumably, in dating reiationsh^ The M 
faults would be obtained with amatnx corroding to 
the same evolutionary dUtance as that between thea, 
ouenca. being compared, Because we are most .nterested 

nsi*9 ' — - f ""i^ir:: 

„ry distantly related sequences, we will eoncsmtrate on 
matrices that are calculated at large evolutionary da- 
IZ, in Figure 35. w. show the log ^.^^ 
PAM. to two significant flgum; tth "°'^^,fj™ n " 
correspond, to saquence, that are about 80% dlffcr.nL 
" orEr to establl* it. superiority 
2S0-PAM matrix with th. new mutation data matrices at 
other evolutionary distance., with other ^. rt ^f*"' 
and with partial information from th. matrcowlf. using 
two different computer methods. This extend, work » 
ported previously. 5 - 6 



Measurement, of Similarity between Sequences 

We currently use two statistical computer methods for 
J™ ™*e extent of similarity be^ean sequent, pro- 
^ ALIGN and fl ELATE, described in ch.P»r V 
ALIGN determines th. maximum score *et on * 
echoed by an alignment of a pair of sequence, and com- 
oTreTthat score with the score, achieved by random par- 
of the two sequent*. The alignment score, 
Z difference between theW* far *e r«l 
end the average score from the random.** fiance, 
dvidad by *r«««l.rd deviation of th, scores from *. 
„ndomiid sequent in order to 
pendenee Of aliment score, On evoluriontry 
we have connected a model sequane. of 1«J mMum 
having an av.mge amino mU composition. .* tam*- 
; „ W ., «u.n«. . family of other u«^«* xnown 
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Roura BS. Scaring matrix for tho wotuilowry d&rtm of 250 
PAM*. This b the log of th. odds m«rix; •tamenn ere shwjn 
muHpiiad by 100. We refer to xhfa muwion data recrlnfl m«mx 

numbers of point mutations was generated by the process 
described in chapter 22. Figure 38 shows the results of 
ihesa Emulation experiment 7 If we take an alignment 
score of 3.0 SD as an indication of probable relatedness, 
these simulations suggest that relatedness to the initial 
sequence can be demonstrated for sequences that have 
accumulated 550 mutations in 100 residues and are 
nearly 88% different. In real comparisons there are addl- 
tional problems, such « differences In length due to 
insertions and deletions of genetic material and the non- 
average behavior of amino acids in any particular mole- 
cute. Nevertheless, we can usually .detect relationships 
between real sequences of 100 residues that are 85% 

different. _ 
The other statistical computer program, Htmic. 
makes an exhaustive comparison of all segments of a 
given length from one sequence with those from the 
other. The average, value of a preassigned number of the 
highest scores is determined. This average value is com- 
pared with the distribution of such average values from 



as MDMtb-Wb Mw datactad no swtlrteslly meaningful difference 
In th* rwuro using *ls maul* and those using the matm in 
Figure 84, which has one significant fiflUre less. 




mu »oons. Pairs of medal pro*n ^aneea of '"^^ 
nra-i, ammo teW composrdon ware usad. A score above 3.0 SD. 
SdC of Z» raflerin, up, 5fi0mu,t.on, 
pT'oO raridues, is con-car* «ood f *"*?TJl 
pLcant drtferanca batwatn weflcea. although aoood measure 
iTSert ais^nc-s, cfetarioraw rapidly and speeches 
tots. This Ooura li adapted from Ref . 7. 
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Th» difference between we sveio**c 

d, " e "" c V v , lues from the random- 

values from the randomized sequences. TUB ^progra 

quenees that are very different in lengtn or 
crtain portion, of their sequences cons«v«L M££ 
1 alS o be .sad to detect i™™' dUP ™' 
puences . The alterations in this .loorrthm that ar. race, 
%Z detecting internal duplications in • 
obvious- instead of comparing. . sequence with * 
Cn«. it I. »m P ar t d with itself. Exactly c°™P°^ 
jn S segments are excluded from the analysis. 

Effect of Evolutionary Distance on the 
Mutation Data Matrices 
In order to examine the effect of the mutation distance 
„ Lh * ^scoring ^ is d-htd. - cho» ^ 
It pairs of related sequences. ranging h«w«« "» " ld 



86% different from one another. £«h pair w*W 
»rir.. Fioure 87 shows how the alignment 

^MkZ, ol*»« for which The muutior ie» 
2d?^ computed. The orfmun, ohoic. of seor- 
• JLh. is sequence-dependent; it is a function of 
,n, matrix is ^ „ e 

^e^u 'JSLSZy <ZZ b«-n th. protein, 
l oTmet choice over this range of protein company. 

" »*» <»"• » «■— ** " d * r *° ™ " h values 
«urv« are near their maxima and have values 

"^Bb^ how the select comparer .scores 

Scoring matrix is computed. Four pairs of s*uenc« 
Ire identical with those tested using alignment scores. 

score, are almost 2 SD lower, confining that 
*; .^memTccras are more sensitive for sequences erf 
limilaTlength and arehltacture. The matrix calculated at 

SS. tTresult obtained with ALIGN, despite th. choice. 
nTeantn^L* «*** as penalty. UN. and segmentcompar- 

employed in these wo programs, p refer to th. Z5« 
PAM mutation data scoring matrix as Mum w - 




in » «*• 

«. **— « — - * rr^r.« oTmT: 

Ol*an» at th. m..«,.n d» m«rlc~ ""J ™ lS0 , 20 0. 1*1. 

deviations of *• score, a^ thenrf^^wt « 
felling arouimo. compart*.™ 

bin alpha chairs human «. ^YOtfObln-hutTwn j« 

open rfiVrtOrtrf, cytochrome e-hcrrt v.. ^^^^ hrBnm 

human Nd;«/irf iquirt. >9 mu ch^n C* h<Ht«>.«IY region- 
Gal vi. b«»i -microglobulin- human. 



an w ,M 

aleuland n4. 30. 100. 1M ' 20 °lf™1"' Mlw fmur.87 
Srir kd« to b. above 3 SO with .nv inatrix. 



\ 
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3S8 ATUVS OF PROTEIN S^SenC^ AND STRUCTURE 1978 

Comparison of Scoring Mat/icss 
" we have compared a number of scaring marries, using 
ALIGN. The results of these comparisons, .rwotvmg • 
broad Section of pairs of ranted sequences, are listed 
in Table 24. MDM 7a gives the highest average score^l- 
though it does not always give the highest score Jor a 
particular comparison. It is the omy 
tently detect, relatedness (scores > 3.0 SD) for theatre 
range of sequences tested. In the comparison of anti- 
bacterial substance A with neocamncstatin, GCM and UM 
give better scores than MDM^. This may be due to the 
conservation of what are usually more mutable amino 
acids. In the other comparisons, matrices based on muta- 
tian data perform better, and MDM^ usually gives the 
strongest indication of re I axed n ess. The average sc*r«sar* 
shown at the bottom of the table, Tha seeing MDM^ 
is 1 SD better than that using AAAM, 2 SO bernr than 
that using GCM and almost 3 SO better than that using 

UM Tab|e 25 shows segment comparison scores for a broad 
range of sequences including tes* for internal duplications 
using different scoring matrices. Again, on oc^s.on 
another matrix gives a better score, but only MDM TO 



consistently indicates known relationships between sc- 
iences: of the scoring matrix that we tested rtis 
dear,* the best. The average score using rt is 15 SO better 
Than that for any of the other matnees. 

In order to ascertain whether either ALIGN or RE- 
LATE produces falsa- positive results with any of the scor- 
ing matrices we tested, we examined 28 pairs of unrela^d 
p/otair,. Neither program gave talse-positive «*. njm* 
any of the matrices. The mean alignment score for the 28 
comparisons was between 02 and -0.2 for all four matrv 
ch. The mean segment comparison score ror the 28 pa.rs 
was between 0.3 and .0.4 for all four matrices. All of 
these trials ware based on 100 randomized sequence com- 
parisons. 

Comparison of MDM 7Q with Its Predecessors 

Using a variety of dismndy related sequences we have 
compared the results using the recently derived MDM 78 , 
the two previous mutation data matrices. MQM^ and 
MDMgar based an one-fourth and one-half as much date 
respectively, and components of MDM^: the diagonal 
elements alone, with all off-diagonal elements equal to 
wro and the off-diagonal elements, with the diagonal 



Tabid 24 

Comparison of Matrices far Calculating Alignment Scores 

Score (In SD units) Obtained with 



Sequences Compared 



Antibacterial substance A - Str^ptomyc^s vs. Neocarzinostatin - 
fJSSJT- Ooxrldium P KWuri*,um vs.Ferredoxin -SpiruUn* maxima 
Hemoglobin alpha - Human vs. Myoglobin - Human 
Hemoglobin alpha - Human vs. Globin CTT-III - Midge larva 
Cytochrome c - Horse vs. Cytochrome c« - Spuvhna 
Cytochrome c - Horse vs. Cytochrome c 5S1 - Dtsulfovtbno 
"Beta, -microglobulin - Human vs. Ig mu chain C4 homology region - Human 

l Q mu'ehain C4 homology region - Human Gal vs. Ig epsiion chain C4 

homology region - Human Nd 
Average score 



UM 


GCM 


AAAM 


mdmts 


3.1 


3.2 


2.6 


2,9 


0.1 


1.6 


1.8 


3.4 


5.8 


6.6 


9.9 


10.7 


2.0 


2.4 


3J2 


3.5 


4.5 


4.3 


7.3 


6.1 


0.2 


0.4 


0.4 


3.9 


3.6 


3.3 


4.7 


4.8 


4.7 


9.0 


9.2 


12.1 


3.0 


3.9 


4.9 


5.9 



in these comparisons, we used values for tha gap penalty (p) ai * 
L m«rix IB) *.t h~B b«n u~fu. for a b^*«*"* 
SMuence comparisons In our «e»rten«, WP'^W SO and 60 for 
ZTl^rfaSi.l and 1 for (^M. and O^ond^^ 
comparison of antibacterial tuhmnai A with nwartlnwwan. 
, bi* of 20 .nd . P .n«hy. of BO were used whh "^^^ 
thae are mare typical choicw for detecting very datlrtt i«Qu*nca 
*l*i e «hip*. in the compel using A AAM for wh«hj «jt 
•xpertenc. i. limited, we vried a from -2 to +4: P w« ch«P w 
be 6 and S. Tht» viluw produced ■n 9 nmemi **>« were anular m 



whfrl o-w and wp Hmgth to tl^mfcrti wsSna the other 

HS b = -2. Thn* hundred random** sao^oa. «mp**- 
used \« detennlnlnfl «dto for AAAM MOM«- 
ZrJ^ p.r«m «ndard Radons * 
4%. UM *nd GCM score, ware cdcuiatBd 
oquenc. compvfwn*: thus, tha mhnatcd oirctnt .waterd devla- 
tiam of mate rconts aro 7%. 
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Table 25 

Comp^n of M^c^^la^ 



Sequence s Compared 



ferr.do.in - ClnvMum p^un^m vs. Fernstam 

Trooonin C - R»bbit vs. Paralbumin - PW« 

^" C - Rabbis. Myosin A1 .i*t chain - R.bb-T 

Internal Duplication 
Tropomyosin alpha chain -Rabbit 

Protease inhibitor, submandibular gUnd - Dog 

Cytochrome c 3 - Desutfovibrio gtga* 

Farredoxin - C. pasXEurianum 

Average score 



Score li" SD unita) Obtained whh 



UM 


GCM 


AAAM 




4.7 


3.1 


2.5 


35 


1.6 


2.8 


3.1 


4.1 


3.9 


3.1 


44 


6.Q 


7.6 


6\3 


ao 


10.2 


8.0 


9.3 


6.7 


15.1 



5.9 


4.0 


3.6. 


•B.3 


4.1 


3.6 


5.3 


7.9 


0.5 


13 


0.7 


3.9 


7.8 


s.a 


7.1 


7.7 


4.9 


4.6 


4.6 


7.4 



^ . .„< i« itin f«rTBdoxin Internal duplies* 

in *e cytochrome V'* and In*, rtrr wox. 

other comparison we used a ens* crf20 

hundred rt ndo mi »d ,^-nce. comparison, wort. u** ,n calcutt 



^or« for AAAM ind MOM,, : *«, tht P^rt 

«or these .core en, <%. One hundn« ^ 
used for UM and GCM; thus, *>Ir P .n*m retard dev..™* 

■TV 7%. 



Table 26 

* lu i An Data Matrices for Calculating Align ment Scores 
Comparison of Mutation u ara matn«=» _ 



Sequences Compared 



Antibacterial substance A -Stnvtcmyc* vs. 

Neocardnostatin - Stnp^rnyeas 
Ferredoxin -Clostridium pastmjrianum vs. 

Ferredoxin - Spirvfina maxima 
Hemoglobin alphas Human vs. Myoglobin - 

Human ■ 
Hemoglobin alpha - Human vs. Glob.n CT- * 
'V. ill -Midge larva' 

Cytochrome c - Horse vs. Cytochrome c« - 

Spiruiina 

Cvtochrome c - Horse vs. Cytochrome Cjm - 

Qesutfovtbria 
Bet^-microglobulin - Human vs. lg mu chain 

C4 homology region - Human fcai 
lg mu chain C4 homology region - Human 

Cat vs. lg epsilon chain C4 homology 

region - Human Nd 



Scares tin SD units) Obtained with 

Diagonal Off -diagonal and 
Only Averaged Diagonal 



MDMbt MDiVW MPMra 



vfiu T ' ~.-^ M 

UM MDMya . ■ MDM 7B 



Average score 



2.0 


V* 


2.9 


3.1 


1.4 


1.9 


2.6 


2.6 


3.4 


0.1 


2.7 


2.7 


9.9 


9.7 


10.7 


5.B 


9.9 


10 J 


2.6 


2.4 


33 


2.0 


0.9 


3.S 


9.6 


5.4 


6.1 


4.5 


5.6 


5.8 


3.3 


3.9 


3.9 


0.2 


2.0 


2.6 


3.3 


2& 


4.8 


3.6 


3.9 


4.8 


10.1 
5.0 


11.5 

5.1 


12.1 
5.9 


4.7 
3.0 


11,2 
4.7 


11.9 
5.5 



bte of 6 w. u»d wl* MOM,, wd MDM„ bK*» 



UM. 
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Table 27 

Comparison of Motion Data Matrices to* ***, S^ Con^™ Scor* 

Scares (in SO units) Obtained with 



Sequences Compared 



Diagonal Off-diagonal and 
Only Averaged Diagonal 
MOMgj Mtm, MDM^ MDMya MOM78 



Cytochrome c« ^Monochrysls vs. Cytochrome 

_ Rhodospiritium 
Azurin - aardeteita vs. Plastocyanin - 

French bean 
Ferredoxln - Clostridium pasmimnum vs. 

Ferredoxin - Desulfovibno 
Troponin C - Rabbit vs. Paralbumin - Pikb 
Troponin C - Rabbit vs. Myosin A1 light 

chain - Rabbit 

IntBma) Oup»eati on 

Tropomyosin alpha chain - Rabhit 
Protease inhibitor, submandibular gland - Dog 
Cytochrome c 3 - Desulfovibrio gigas 
Ferredoxln - C. pasteurianum 

Average score 



2.9 


2.6 


3.5 


4.7 


3.0 


3.2 


3.8 


3.7 


4.1 


1.6 


2,5 


22 


5.4 
10.0 


5.3 
9.8 


6.0 
10.2 


3.9 
7.6 


5.2 
4.8 


3.9 
'11.6 


14.8 


13.4 


15.1 


8.0 


9.0 


133 



7.8 
6.5 
1.7 
7.1 

6.7 



5.9 
5.d 
3.8 

7.3 

6.5 



8.3 

7.9 
33 
7.7 

7.4 



5.9 


4.7 


8.8 


4.1 


4.2 


6.6 


0.5 


3,2 


1.6 


7& 


7.3 


7.5 


4.9 


4.9 


6.6 



w. u»d . »gm*n, length of 16 rfrfdue. forth. £ 

and f.mKto»in i^ma. dwHesdon companions 

for tha oth-r comperi""*. Thn* hundred random** saquer« 



comptrtiom mt; used in dettutnlnirtfl teorei for m«rlc« ax«pt 
UM. for whioh 100 rartdomlztd ttciienct comparisons wtni und- 



elements equal to 60 (the approximate average value for 
diagonal elements in MDM TO ). These components can be 
thought of as intermediate between UM and MDM^. The 
first has zero for all off-diagonal elements; however, the 
pattern of MDM 78 is retained in the diagonal elements. In 
the other modification, all of the diagonal elements are 
equal as in UM, but the nondiegonal elements from 
MOM-- are retained. In Table 26. the results from pro- 
gram ALIGN are shown, and inTahle 27. the results from 
"program RELATE are shown. MDM^ is clearly lupenor 
to the earlier ones, which were based on less data. The 
main differences are in the off-diagonal elements, parfc 
ularly the negative ones for which there was previously 

fitrJadata. . . 

Both amino acid mutabilities (diagonal elements) ana 
their exchange probabilities (off-diagonal elements) are 
important aspects of MDM^. Comparison of the results 
. using the diagonal elemerra only with those using UM 
shows that the pattern in the diagonal elements Is helpful 
in calculating alignment scores. In calculating segment 
comparison scores, we diagonal elements are not. on 
tha average, an improvement over UM. However, diagonal 
elements are helpful in. some cases, such as m detecting 



the internal duplication in cytochrome Comparison 
of UM with the matrix containing the off-diagonal and 
the averaged diagonal elements shows that the off- 
diagonal terms contribute more to the good results than 
the variability in the diagonal terms. The MOM^ that 
contains both patterns is clearly superior to either of the 
partial matrices. 
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24 Duplications 
In Protein Sequences 



IV, 



X Barken LK. /CtftcAam, and M.O. Dtyhoff 



We have reported a routine procedure forscreemng pro- 
tcin sequences for evidence of intragenic dup^t.on, a 
that time we tested 163 protein sequence, 
. of the 1 16 superfamilies of unrelated prote.ns listed in the 
'Cr: Atlas Supplement % chapter 2. Twenty superfamihes 
. " ' Cere found to contain proteins that reflect internal gene 
■ duplications The intragenic duplications ° 
two major type,: (U one or more dup— s of aH o 
pa n of a gene to produce a protein with two * se«ral 
regions detectable sequence homology, and (2) repealed 
replication of a mil ONA segment W Produce a pro- 
tein that is repetitive over most of its length. These dupli- 
cations had occurred in protaryte* and eukaryotes o«r a 
wide span of evolutionary history, from 1*'^ J""™ 
years ago In an early anaerobic bacterium (dostrWtal-type 
fsrradoxin) to very recently In the human .ne dnc. *• 
divergence of other primates (the haptoglobin , alpha-* 
chains We have extended thi. study by testing 117 of the 
sequences that appear in this volume, one from each of the 
new protein superfamilies and families listed m Table l. 



Computer Method 

The computer program RELATE, described in chapter 
1 can be used to detect repeated patterns within a protein 
sequence. The program compares. every possible segment 
of a given length with every other segment of that length 
within the sequence- A segment score is accumulated from 
comparisons of the amino acids occupying corresponding 
positions within the two segments. No gaps are perrmTteq 
within the segments being compared. A matrix of com- 
parison scores for each amino acid pair is supplied. 

A numerical property of the distribution of segmem 
scores is determined for the real sequence and for at least 
100 permuted sequences with the same amino ac.d com. 
position. The segment comparison score is calculated as 



the difference between the value determined For *a reel 
sequence and the average value determined from all of the 
permuted sequences, divided by the * a ^ rd < eV, ™ n * 
the values from the permuted sequences. The segment 
comparison score is thus expressed in SO unhs, and th 
probability of occurrence by chance of a score higher 
than a particular value can be obtained from the cumula- 
tive standardized normal distribution table. A score > 3j> 
SD (P < 0.0014) is taken as indicatWe of internal duplies 
tion For th- numerical property, we. have used the mean 
of a" predetermined number of highest segment scores. 
This was determined for each protein from the segment 
length (s) and the total length of the sequence 
L/2 - s + 1. This expression is equal to the number OT. 
scores to be expected from comparisons of corresponding 
segments if the sequence has exactly doubled. 

Several Oth ar kinds of output are obtained from the 
computer program. An ordered list of many segment ^nv 
parisons giving the highe* scores is *«™***™£ X * 
the regions of the sequence showing unusual similarity can 
be identified. From the list, a table of dlmUcemants of 
the matching segments, giving the frequency of occurrence 
and average score for each displacement, is A 
protein that has duplicated will have many high scores at a 
displacement of half of its total length A protein wift 
prominent l*re*idue periodicity will have many high 

scores at displacements of 10, 20. 30. ate. 

Although the segment company score ^mv des _ an 
easy and straightforward criterion of Internal dupHewon. 
t lyl'H) Vect duplications In certain cases. If a du- 
pnoJon involves only « -all «^ 
may not be detect* unless it is composed of amino adds 
Z a* usually highly conserved in proteins. We have pur- 
posely designed the procedure to detect m.jor dupl.cat,on 
and extensive periodicity. If too many changes have oc- 
curred in the sequence, an ancestral duplication may not 
be detected. 

3S9 
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Table 28 

P.ntEinswhh Piously Detected Duplications 



Score" 
($0 unhi) 



Length of 
Protein 



Approximate 
Lang* of 
Repeat 



Number of 
Repetition! 



Percent 
Repetitious 



Eukaryota Sequences 
Collagen alpha 1 chain (rat) . 
Lipid-binding protein A-l (human) 
Keratin B2A (sheep) 
Keratin B-IIIA3 (sheep) 
Alpha gi casain (sheep) 
Bamb'min (unk) 

Immunoglobulin mu chain C region (human} 
Immunoglobulin apsilon chain C region human) 
Immunoglobulin gamma chain C reg.cn (gumea pig) 
( .-. i Serum albumin {human) 
Sperm histone (bovine) 
Histana H3 (bovine) 

Haptoglobin alpha 2 chain (human) 
Troponin C, skeletal muscle (rabbit) 
Myosin A1 light chain (rabbit) 
Pa rv albumin (pike) 
Upkl-binding protein C-l (human) 
Prothrombin (bovine) 
Neurophvsin 2 (pig) 
'■■ Posterior pituitary peptide (bovina) 

Alpha cryscallin A chain (bovine) 
Protease inhibitor, Bowman-Birk (soybean) 

Profcaryota Sequences 
Cytochrome Cj (Daultovibrio desulfurhans) 
Cytochrome c, [Desutfuromonas acetoxidans) 
Cytochrome c 3 {Desuffovibrio vulgaris) 
Murein-lipoprotein {Escherichia coli) 
Ferredoxin [Clostridium pasteurianvm) 
Ferredoxin [Chromav'um sp.) 
Rubredoxin (Pseudomonss oleovorans] 



13.3 b 

11.4 
S.6 
4.0 
13 C 

• 3.8 d 
19.1 
1S-3 
9.4 
12.1 b 
3.7* 
3.6 



1,052 
245 
171 
131 
199 
24 
452 
423 
329 
584 
47 
13S 



3 
11 
10 
10 
20 
4 
108 
108 
108 
195 6 
8 

I 9 
M3 



37.1 


143 


59 


7:5 


159 


76* 


7.6* 


190 


76* 


3.6 


108 


39 


33* 


57 


27 e 


10.8 h 


582 


79 


4.9 


92 


23 


4.0' 


48 


11 


4.0 ! 


173 


30 


3.9 


71 


28 


4.3* 


102 


17 


4.1 


68 


18 


3.7 


107 


50* 


8-2 


58 


14« 


6.8 


55 


28 


3.3J 


81 


2B 


3.5 


174 


55 



337 ■ 

18 

13 

11 

>4 
4 
4 
4 
3 
3 
3 
(3 



2 

2 

4 
3 
2 

2.5 
2 
2 
2 



96% 
81% ■ 
76% 
.84% 
7 

67%. 

96% 
100% 

98% 
100% 

51% 

39% 

83% 
96% 

ao% 

72% 
95% 
27% 
50% 
46% 
35% 
79% 

67% 
79% 
93% 
50% 
100% 
69% 
63% 



Modified from T«nla S In Barker, «.C K^am, U.K., 3 nd 
D-yhoff, M.O., J. Md. Biol. 10. 26*281. 1978. 
«U„.« otherwise noted, * -J*^ ^2 " S 
1973 mutation data rrwrtx. • "flnwrt length of 20. »no iyu 
random runs. 

bc^. wari on airing nwlduM 1-5O0. 

^Zl o^ned J^mem tan,* of 1 6 and 300 nndom rum. 
*Swni obtained «»nfl stament lennth of a. 



*T h ± ^or repeetins unit *™ ^" oua suucturt 

witfiln 1t»If. - n< ' 

fe™™ H«*d en testing rttlduos 31-190. 

fel^Sn-^w^n M» - « - »° -* • 

hgtort based on testing rwtduei 1-323. 

W. ob»ined ^^V^y J„ 12 and 300 random 
lscoro obtained usina segment lencih qt a or 14 - 

runs. 
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.. i.noth used initially w« 13 

l„ ,his study, the «*» » ^ 

fa w*. SO residues or ion*er « « 0 of 
30 49 residues. »nd 8 for •-^"Ji* or porter 

u ,m,nt len*hs. & ^ *" ^n give higher sec. 
oMttam or inserts "^^JL, „ ng *s-it 
™* comp.ri.on score ^^^^ *. do- 
„, a „v insertions or deletions 

plia tion. racing the W<« ^"ZZZ* «- 
mutation data matrix shown .n F. 3 ure 8*. 
Results 



,„ Table 28 we h W IM *. P™teins found prt»lo- 
,v> " 0 have duplications detectable by ^""^^ 
l^n/p^in, -* J^-^E 
present investigation are shown «nTrtl«» « 
present new families in ? e8 ion, 

vl0 u„y ^ "-"-'^^r^ S - very 
. , . in die amln°-termlnal P«m°" o ' p '7" ^ r0(nhin („* 

Alignrnent9andf. 9 ure2*t„ 0 ht chain share the two 
(see Alignment 30 and Figure. 4S end 46). 



Three of the P««eins belon9 10 ^p^" 1 " 
*ST« not contain example* of detail, dup^ 
cXs. Submandibular 9 land protease mhibltor and 
"omucoid share *. duplication that produced . double- 
£L* ^uence compared with the related pancreatic 
.jLrTtr^n inhibitor, which contain, ony one 
hlXTon. Subsequently, a partial ouphomon 
^ , l^hnmology ™»ion to *. ovomucoid .essence 
^/jSn^d Hour. 35). The high po«n*l 
JliS protein from Rhado^don^ ■ 
dTsTtiy rSated ov.r it. entire length to orotem. from 

ofAM *"■» «"* "entities w-**e third «, 

*ey are to have several insertions that d,srupt *. 



Table 29 

New Proteins with Detected Duplications 



Sequence 




Plasminogen (human) 
Ovomucoid (Japanese quail) 
Prote.se inhibitor, submandibular 
Calciumnitpendent regulator prowm (bovine) 
Myosin DTNB liflht chain (rabbit) 
Tropomyosin tlpha chain (rabbit) 
Troponin T, skeletal muscle (rabbit) 
.mmuhOBlobulin «lph»1 chain C reg.on (human) 
High potential iron-wlfur protein 
{Phodopseudomonas gel atino*) 

, wera datermlnid with a «tfln»jn 

■Unless otherwl» newd, «©«■« wen) DBlC11 " 
ItngtP oM S and 100 random runs. 



within ttsalt\ . „ 

csoam Dbttinsd o«ing «*nent tengih of 25. 



\ 
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Tafiie3Q 

SuperfamUi es Containing Oup^ Pror^ 




Families 



DrtacraWe Nor Probable 

nunlicatiom Duplicated Duplications 



2 Cytochrome c 3 relatad 

6 Ferradoxin related 

7 High potential iron-sulfur protein 

8 Rubredoxin 

39 Prothrombin related 

54 Protease inhibitors (PSTl-typa) 

58 Protease inhibitors (Bowrtwn-Birk tynel 

64 Posterior pituitary peptide 

81 Hemolytic peptides 

89 Immunoglobulin C regions 

and related proteins 
94 Histcne H3 
97 Sperm hirtone 
139 Alpha crystillin 

141 Keratin high-sulfur fraction H2 related 
143 Collagen 
147 Tropomyosin 

1 49 Tro ponin C re! ated 

150 Troponin T 

1 32 Animal lipid-binding proteins 

154 Murein-lipoprotein 

155 Alpha s j casein 
161 Neurophysira- 

164 Haptoglobin alpha chain 

169 Serum albumin ^ 



7 
2 



1 
2 



duplications can explain th. positions of *e high-ieoring 
£££ However, me sequence contain * 
number of acidic and basic residues (see Table 331 consis 
tent with a historv of sane duplication. 

In Tahle 30 we have lisrad all of the superfam.hes that 
contain protein, with detected duplications. In mo™ 
whtra more than ona family (a group of prmems that are 
,e« *an 50% different in sequenca) ,. 
superfamily, the duplication is a.ther not detactabk or 
^present in some of the families. Sequences from mo* 

family are known in 12 of the 
tuning duplicated proteins; ..together 55 families are re- 
printed in thesa 12 superfamilies- Clearly d.stingulshabl. 



duplicated regions 3 re seen in 26 of ^'"^ 
other 29, 16 of the families lack the dup^.c^on whj.eas 
13 have sequences of approximately ^ " 
the dupMcated sequences and show weak """T^ 
plication, but th. sequence information has ^degraded 
bv accumulated point motions, inserc.ons, and deletions. 
Tn a" t^out 1 2%) of 314 families contain pro^s 
with detected duplications. 

References 
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25 Composition of Proteins 



, mbar of residues o* each kind for ail 



34 .re obtamed. Tabla 33 .s U n of 
characterized sequence named mthe £ > 

chapter 2 and described in *e " 2 , ,„d 
Jin*— i972. Vo.ume ,8 

f • 3. Th«e **• 9rouP=d imp 31* te»*-o Q ^ 

<~ ^ so* different from ~ U diKenna 

bY » lean 5% from other Mque „« 5 a* 

«^ M »y B ^'l S und«=^.ined residues » * 
proinsulin values. «« J 1 ".^ to „,„, argmmea 
end, of *• Coeptldea ware ^ um *J No eomplate 

, w u«n« o» *• » lta *" r . composite 

Table 31 



shown is *« Of *e fir« prore , ^ei 

*" ^-rr'^so *,fdS reiatad ^ 
on the data P*-» ™ ,| ph3 betmd Mb «' 
are adiacent to each other. An a* 





?.£*.% of .mine add Un tam-Y 1. »"« 
l 1 ssu mm«t«erd,e314f«m,l«. 

a pool of 314 wquwas. «acn feu . 

tions is shown in Table 32- 

Table 32 

Average Percent of Amino Acid Groups 
in Proteins 



. — i «f 31 4 lequeflcM. one lof w** 



Small aliphatic 
Acidic 

Acidic* acid amide 
Basic 

Hydrophobic 
Aromatic 



A+G 
S+T 

d+b+n+e+z+Q 

K+R+H 

l+v+i+pa 

F+Y+MV 



16.9 

13.1 

11.6 

19.8 

13.3 

20.2 

a.a 
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